Search CORE

6 research outputs found

PADIC: extension and new experiments

Author: Harrat S
Meftouh K.
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 28/04/2018
Field of study

International audiencePADIC is a multidialectal parallel Arabic corpus. It was composed initially by five Arabic dialects, three from the Maghreb and two from the Middle East, in addition to standard Arabic. In this paper, we present an augmented version of PADIC with a Moroccan dialect. We give also an evaluation, using the σ–index, of the computerization level of the Arabic dialects present in PADIC which reveals that these languages are really under-resourced. Several experiments in machine translation, in both sides between all the combinations of language pairs, are discussed too. For each language, we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. The results show that this interpolation is in some cases without effect on the performances of translation systems and in others is rather penalizing

INRIA a CCSD electronic archive server

Cross-Lingual Semantic Similarity Measure for Comparable Articles

Author: A. Fujii
K. Meftouh
M.L. Littman
M.W. Berry
N. Habash
S. Deerwester
T.K. Landauer
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceWe aim in this research to find and compare crosslingual articles concerning a specific topic. So, we need measure for that. This measure can be based on bilingual dictionaries or based on numerical methods such as Latent Semantic Indexing (LSI). In this paper, we use the LSI in two ways to retrieve Arabic-English comparable articles. The first one is monolingual: the English article is translated into Arabic and then mapped into the Arabic LSI space; the second one is crosslingual: Arabic and English documents are mapped into Arabic-English LSI space. Then, we compare LSI approaches to the dictionary-based approach on several English-Arabic parallel and comparable corpora. Results indicate that the performance of cross-lingual LSI approach is competitive to monolingual approach, or even better for some corpora. Moreover, both LSI approaches outperform the dictionary approach

Crossref

INRIA a CCSD electronic archive server

Institutional Repository of the Islamic University of Gaza

PADIC: extension and new experiments

Author: Harrat S
Meftouh K.
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 28/04/2018
Field of study

Hal-Diderot

Constitution d'un corpus de la langue Arabe à partir du Web

Author: Laskri Med Tayeb
Meftouh K.
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 09/10/2007
Field of study

International audienceLa toile est une source intarissable de données textuelles. Ces dernières années la communauté travaillant sur les différents aspects de la langue s'est tournée vers le web afin de bénéficier de cette masse impressionnante d'informations. Cet article décrit un outil de construction de corpus pour l'Arabe. Il permet de recueillir automatiquement une liste de sites dédiés à la langue Arabe. Ensuite le contenu de ces sites est extrait et est normalisé. Le corpus ainsi constitué peut être utilisé dans diverses applications de traitement du langage naturel et plus particulièrement dans le calcul de modèles de langage statistiques

INRIA a CCSD electronic archive server

Hal-Diderot

Constitution d'un corpus de la langue Arabe à partir du Web

Author: Laskri Med Tayeb
Meftouh K.
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 09/10/2007
Field of study

Hal-Diderot

PADIC: extension and new experiments

Author: Harrat S
Meftouh K.
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 28/04/2018
Field of study

INRIA a CCSD electronic archive server

Hal-Diderot